D² Pinball Score (d2_pinball_score)#

d2_pinball_score is a regression score that measures the fraction of pinball loss (quantile loss) explained by a model relative to a simple baseline that always predicts the empirical (\alpha)-quantile of y_true.

  • Best possible score: 1.0

  • Baseline score (constant (\alpha)-quantile predictor): 0.0

  • Can be negative (worse than the baseline)

It’s the quantile-regression analogue of (R^2): “how much better than the best constant quantile prediction?”.


Learning goals#

  • Define pinball loss and the D² pinball score (with math)

  • Build intuition for the asymmetric penalty and the quantile baseline

  • Implement d2_pinball_score from scratch in NumPy (weights + multioutput)

  • Fit a simple linear quantile regression model with subgradient descent

  • Know pros/cons and when to use D² pinball

Quick import#

from sklearn.metrics import d2_pinball_score

Prerequisites#

  • quantiles / percentiles

  • basic regression notation

  • piecewise functions

import warnings

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.linear_model import LinearRegression, QuantileRegressor
from sklearn.metrics import d2_pinball_score, mean_pinball_loss
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Pinball loss (quantile loss)#

For a quantile level (\alpha \in [0, 1]), define the pinball loss for one sample as:

Let (u = y - \hat{y}) (true minus prediction). Then

\[ \rho_\alpha(u) = u\,\big(\alpha - \mathbb{1}[u < 0]\big) \]

Equivalently (piecewise):

\[\begin{split} \rho_\alpha(y, \hat{y}) = \begin{cases} \alpha (y - \hat{y}) & y \ge \hat{y} \\ (1-\alpha)(\hat{y} - y) & y < \hat{y} \end{cases} \end{split}\]

The mean pinball loss over (n) samples is:

\[ L_\alpha(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n} \rho_\alpha(y_i, \hat{y}_i) \]

Interpretation:

  • (\alpha) controls asymmetry: for (\alpha=0.9), under-prediction is penalized 9x as much as over-prediction.

  • (\alpha=0.5) gives (\rho_{0.5}(y,\hat{y}) = \tfrac12|y-\hat{y}|), so d2_pinball_score(alpha=0.5) matches the MAE-based D² score (d2_absolute_error_score).

def pinball_loss_residual(u, alpha):
    # Pinball loss as a function of residual u = y - y_pred
    u = np.asarray(u)
    return np.where(u >= 0, alpha * u, (alpha - 1) * u)

u = np.linspace(-4, 4, 400)
alphas = [0.1, 0.5, 0.9]

fig = go.Figure()
for a in alphas:
    fig.add_trace(
        go.Scatter(
            x=u,
            y=pinball_loss_residual(u, a),
            mode="lines",
            name=f"alpha={a}",
        )
    )

fig.add_vline(x=0, line_width=1, line_dash="dash", line_color="gray")
fig.update_layout(
    title='Pinball loss is a "tilted" absolute value',
    xaxis_title="residual u = y - y_pred",
    yaxis_title="rho_alpha(u)",
    height=380,
)
fig.show()

2) The baseline and the D² pinball score#

2.1 Best constant predictor = empirical (\alpha)-quantile#

Suppose you restrict yourself to a constant prediction (c) for every sample. You minimize:

\[ J(c) = \sum_{i=1}^n \rho_\alpha(y_i, c) \]

A subgradient w.r.t. (c) is:

\[ \partial_c J(c) = \sum_{i=1}^n (\mathbb{1}[y_i < c] - \alpha) \]

So at an optimum, roughly an (\alpha) fraction of samples lie below (c). That is exactly the definition of an empirical (\alpha)-quantile.

With sample weights (w_i \ge 0), the condition becomes:

\[ \partial_c J(c) = \sum_{i=1}^n w_i(\mathbb{1}[y_i < c] - \alpha) = 0 \quad\Rightarrow\quad c = \text{weighted }\alpha\text{-quantile of }y \]

2.2 D² pinball score#

Let (L_\alpha(y, \hat{y})) be the mean pinball loss for your model predictions (\hat{y}). Let (\tilde{y}) be the baseline predictions that always equal the empirical (\alpha)-quantile of (y).

Then:

\[ D^2_\alpha(y, \hat{y}) = 1 - \frac{L_\alpha(y, \hat{y})}{L_\alpha(y, \tilde{y})} \]
  • (D^2=1) for perfect predictions (zero pinball loss)

  • (D^2=0) for the constant quantile baseline

  • (D^2<0) if your model is worse than the baseline

Edge cases (matching scikit-learn):

  • with fewer than 2 samples, the score is undefined (nan)

  • if the baseline loss is 0 (e.g. constant targets), sklearn returns:

    • 1.0 for perfect predictions

    • 0.0 otherwise

# The empirical alpha-quantile minimizes pinball loss among constant predictors
n = 250
y = rng.normal(loc=0.0, scale=1.0, size=n)
y[:15] += rng.exponential(scale=4.0, size=15)  # right-tail outliers

alpha = 0.9
q = float(np.percentile(y, alpha * 100))

c_grid = np.linspace(y.min() - 1, y.max() + 1, 300)
loss_grid = np.array([mean_pinball_loss(y, np.full_like(y, c), alpha=alpha) for c in c_grid])

c_star = float(c_grid[np.argmin(loss_grid)])
loss_star = float(loss_grid.min())

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        "Mean pinball loss vs constant prediction",
        "y distribution",
    ),
)

fig.add_trace(go.Scatter(x=c_grid, y=loss_grid, mode="lines"), row=1, col=1)
fig.add_vline(x=q, line_dash="dash", line_color="#4C78A8", row=1, col=1)
fig.add_vline(x=c_star, line_dash="dot", line_color="#E45756", row=1, col=1)
fig.add_trace(
    go.Scatter(x=[c_star], y=[loss_star], mode="markers", marker=dict(size=9, color="#E45756"), showlegend=False),
    row=1,
    col=1,
)
fig.update_xaxes(title_text="constant prediction c", row=1, col=1)
fig.update_yaxes(title_text="mean pinball loss", row=1, col=1)

fig.add_trace(go.Histogram(x=y, nbinsx=40, marker_color="rgba(0,0,0,0.65)"), row=1, col=2)
fig.add_vline(x=q, line_dash="dash", line_color="#4C78A8", row=1, col=2)
fig.update_xaxes(title_text="y", row=1, col=2)
fig.update_yaxes(title_text="count", row=1, col=2)

fig.update_layout(
    height=360,
    title=f"Baseline for D²: empirical {alpha:.1f}-quantile (q={q:.3f})",
    showlegend=False,
)
fig.show()

q, c_star
(1.412878104406565, 1.4261650061141289)

3) Interpreting D²#

Think of (L_\alpha(y,\tilde{y})) as the loss of a “no-features” model: it only knows the labels and predicts a constant quantile.

Then:

\[ D^2 = 1 - \frac{L_\alpha(\text{model})}{L_\alpha(\text{baseline})} \]

So if (D^2=0.3), your model reduces pinball loss by 30% compared to the constant-quantile baseline on that dataset.

Like (R^2), D² is mainly for comparing models on the same target/dataset; it’s not a scale-free number you can compare across unrelated problems.

# Tiny examples

y_true = np.array([1, 2, 3])
y_pred = np.array([1, 3, 3])

for a in [0.1, 0.5, 0.9]:
    print(f"alpha={a}: D2={d2_pinball_score(y_true, y_pred, alpha=a):.4f}")

alpha = 0.9
q = np.percentile(y_true, alpha * 100)
y_baseline = np.full_like(y_true, q, dtype=float)

print()
print("Baseline D2 (should be 0):", d2_pinball_score(y_true, y_baseline, alpha=alpha))
print("Perfect D2 (should be 1):", d2_pinball_score(y_true, y_true, alpha=alpha))
alpha=0.1: D2=-1.0455
alpha=0.5: D2=0.5000
alpha=0.9: D2=0.7727

Baseline D2 (should be 0): 0.0
Perfect D2 (should be 1): 1.0
# D² as predictions degrade from perfect -> noisy
n = 400
y_true = rng.standard_t(df=3, size=n)  # heavy tails

# fixed direction of noise so the curve is easier to read
noise = rng.normal(size=n)

sigmas = np.linspace(0, 2.5, 60)
alphas = [0.1, 0.5, 0.9]

fig = go.Figure()
for a in alphas:
    d2_vals = [d2_pinball_score(y_true, y_true + s * noise, alpha=a) for s in sigmas]
    fig.add_trace(go.Scatter(x=sigmas, y=d2_vals, mode="lines", name=f"alpha={a}"))

fig.add_hline(y=0, line_dash="dash", line_color="gray")
fig.add_hline(y=1, line_dash="dash", line_color="gray")
fig.update_layout(
    title="As predictions get noisier, D² pinball decreases (can go negative)",
    xaxis_title="noise scale added to perfect predictions",
    yaxis_title="D² pinball score",
    height=380,
)
fig.show()

4) From-scratch NumPy implementation#

Below is a small NumPy implementation that mirrors scikit-learn’s behavior:

  • supports 1D targets (n) and multioutput targets ((n, m))

  • supports sample_weight

  • supports multioutput aggregation:

    • 'raw_values' (per-output scores)

    • 'uniform_average' (simple mean)

    • array-like weights of shape ((m,))

  • uses the same baseline as scikit-learn:

    • unweighted: np.percentile(y_true, q=alpha*100, axis=0)

    • weighted: a lower weighted percentile (first value where the weight CDF reaches (\alpha))

Edge cases:

  • with fewer than 2 samples: returns nan

  • if the baseline loss is zero (e.g. constant targets):

    • perfect predictions → 1.0

    • imperfect predictions → 0.0

def _as_2d(y):
    y = np.asarray(y)
    if y.ndim == 1:
        return y.reshape(-1, 1)
    if y.ndim != 2:
        raise ValueError(f"y must be 1D or 2D, got shape {y.shape}")
    return y


def _weighted_percentile_lower(array, sample_weight, percentile):
    # Lower weighted percentile (mirrors sklearn.utils.stats._weighted_percentile)
    array = np.asarray(array)
    sample_weight = np.asarray(sample_weight)

    n_dim = array.ndim
    if n_dim == 0:
        return array[()]

    if array.ndim == 1:
        array = array.reshape((-1, 1))

    if sample_weight.ndim == 1:
        if sample_weight.shape[0] != array.shape[0]:
            raise ValueError("sample_weight must have shape (n_samples,) when 1D")
        sample_weight = np.tile(sample_weight.reshape(-1, 1), (1, array.shape[1]))
    elif sample_weight.shape != array.shape:
        raise ValueError("sample_weight must have same shape as array, or shape (n_samples,)")

    if np.any(sample_weight < 0):
        raise ValueError("sample_weight must be non-negative")

    sorted_idx = np.argsort(array, axis=0)
    sorted_array = np.take_along_axis(array, sorted_idx, axis=0)
    sorted_weights = np.take_along_axis(sample_weight, sorted_idx, axis=0)

    weight_cdf = np.cumsum(sorted_weights, axis=0)
    total_weight = weight_cdf[-1]
    adjusted = percentile / 100.0 * total_weight

    # For percentile=0, ignore leading observations with sample_weight=0 (sklearn behavior)
    mask = adjusted == 0
    if np.any(mask):
        adjusted = adjusted.astype(float, copy=False)
        adjusted[mask] = np.nextafter(adjusted[mask], adjusted[mask] + 1)

    idx = np.array(
        [np.searchsorted(weight_cdf[:, i], adjusted[i]) for i in range(weight_cdf.shape[1])],
        dtype=int,
    )
    idx = np.clip(idx, 0, array.shape[0] - 1)

    out = sorted_array[idx, np.arange(array.shape[1])]
    return out[0] if n_dim == 1 else out


def mean_pinball_loss_numpy(
    y_true,
    y_pred,
    *,
    alpha=0.5,
    sample_weight=None,
    multioutput="uniform_average",
):
    # NumPy implementation of mean pinball loss

    if not (0.0 <= float(alpha) <= 1.0):
        raise ValueError("alpha must be in [0, 1]")

    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"y_true and y_pred must have the same shape, got {y_true.shape} vs {y_pred.shape}")

    Y_true = _as_2d(y_true)
    Y_pred = _as_2d(y_pred)
    n_samples, n_outputs = Y_true.shape

    u = Y_true - Y_pred  # residual = y - y_pred
    loss = np.where(u >= 0, alpha * u, (alpha - 1) * u)  # always >= 0

    if sample_weight is None:
        out = loss.mean(axis=0)
    else:
        w = np.asarray(sample_weight).reshape(-1)
        if w.shape[0] != n_samples:
            raise ValueError(f"sample_weight must have shape (n_samples,), got {w.shape}")
        if np.any(w < 0):
            raise ValueError("sample_weight must be non-negative")
        w_sum = w.sum()
        if w_sum == 0:
            raise ValueError("sum(sample_weight) must be positive")
        out = (loss * w.reshape(-1, 1)).sum(axis=0) / w_sum

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return out
        if multioutput == "uniform_average":
            return float(np.mean(out))
        raise ValueError("multioutput must be 'raw_values' or 'uniform_average' or an array-like")
    else:
        weights = np.asarray(multioutput).reshape(-1)
        if weights.shape[0] != n_outputs:
            raise ValueError(f"multioutput weights must have shape (n_outputs,), got {weights.shape}")
        return float(np.average(out, weights=weights))


def d2_pinball_score_numpy(
    y_true,
    y_pred,
    *,
    sample_weight=None,
    alpha=0.5,
    multioutput="uniform_average",
):
    # NumPy implementation of scikit-learn's d2_pinball_score

    if not (0.0 <= float(alpha) <= 1.0):
        raise ValueError("alpha must be in [0, 1]")

    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"y_true and y_pred must have the same shape, got {y_true.shape} vs {y_pred.shape}")

    Y_true = _as_2d(y_true)
    Y_pred = _as_2d(y_pred)
    n_samples, n_outputs = Y_true.shape

    if n_samples < 2:
        warnings.warn("D^2 score is not well-defined with less than two samples.")
        return float("nan")

    if sample_weight is None:
        w = None
    else:
        w = np.asarray(sample_weight).reshape(-1)
        if w.shape[0] != n_samples:
            raise ValueError(f"sample_weight must have shape (n_samples,), got {w.shape}")
        if np.any(w < 0):
            raise ValueError("sample_weight must be non-negative")
        if w.sum() == 0:
            raise ValueError("sum(sample_weight) must be positive")

    numerator = mean_pinball_loss_numpy(
        Y_true,
        Y_pred,
        sample_weight=w,
        alpha=alpha,
        multioutput="raw_values",
    )

    if w is None:
        yq = np.percentile(Y_true, q=alpha * 100, axis=0)
    else:
        yq = _weighted_percentile_lower(Y_true, sample_weight=w, percentile=alpha * 100)

    y_quantile = np.tile(yq, (n_samples, 1))

    denominator = mean_pinball_loss_numpy(
        Y_true,
        y_quantile,
        sample_weight=w,
        alpha=alpha,
        multioutput="raw_values",
    )

    nonzero_numerator = numerator != 0
    nonzero_denominator = denominator != 0
    valid_score = nonzero_numerator & nonzero_denominator

    output_scores = np.ones(n_outputs, dtype=float)
    output_scores[valid_score] = 1 - numerator[valid_score] / denominator[valid_score]
    output_scores[nonzero_numerator & ~nonzero_denominator] = 0.0

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return output_scores
        if multioutput == "uniform_average":
            return float(np.mean(output_scores))
        raise ValueError("multioutput must be 'raw_values' or 'uniform_average' or an array-like")
    else:
        weights = np.asarray(multioutput).reshape(-1)
        if weights.shape[0] != n_outputs:
            raise ValueError(f"multioutput weights must have shape (n_outputs,), got {weights.shape}")
        return float(np.average(output_scores, weights=weights))
# Quick checks vs scikit-learn

alpha = 0.9

# 1D

y_true = rng.normal(size=80)
y_pred = y_true + rng.normal(0, 0.8, size=80)

print("1D")
print("numpy :", d2_pinball_score_numpy(y_true, y_pred, alpha=alpha))
print("sklearn:", d2_pinball_score(y_true, y_pred, alpha=alpha))

# Multioutput + weights
Y_true = rng.normal(size=(120, 3))
Y_pred = Y_true + rng.normal(0, 0.5, size=(120, 3))
w = rng.uniform(0.2, 2.0, size=120)

print()
print("Multioutput (raw)")
print("numpy :", d2_pinball_score_numpy(Y_true, Y_pred, sample_weight=w, alpha=alpha, multioutput="raw_values"))
print("sklearn:", d2_pinball_score(Y_true, Y_pred, sample_weight=w, alpha=alpha, multioutput="raw_values"))

print()
print("Multioutput (uniform_average)")
print("numpy :", d2_pinball_score_numpy(Y_true, Y_pred, sample_weight=w, alpha=alpha, multioutput="uniform_average"))
print("sklearn:", d2_pinball_score(Y_true, Y_pred, sample_weight=w, alpha=alpha, multioutput="uniform_average"))

# Constant-target edge case
y_const = np.ones(10)

print()
print("Constant target:")
print(
    "perfect:",
    d2_pinball_score_numpy(y_const, y_const, alpha=alpha),
    d2_pinball_score(y_const, y_const, alpha=alpha),
)
print(
    "imperfect:",
    d2_pinball_score_numpy(y_const, np.zeros_like(y_const), alpha=alpha),
    d2_pinball_score(y_const, np.zeros_like(y_const), alpha=alpha),
)
1D
numpy : -0.5638604228888011
sklearn: -0.5638604228888011

Multioutput (raw)
numpy : [-0.1124 -0.1877 -0.2139]
sklearn: [-0.1124 -0.1877 -0.2139]

Multioutput (uniform_average)
numpy : -0.17130461752532478
sklearn: -0.17130461752532478

Constant target:
perfect: 1.0 1.0
imperfect: 0.0 0.0

5) Using D² pinball while optimizing a model (from scratch)#

On a fixed dataset (y), the baseline loss (L_\alpha(y, \tilde{y})) depends only on (y) (and (\alpha)), not on the model parameters (\theta).

So:

\[ D^2_\alpha(\theta) = 1 - \frac{L_\alpha(y, \hat{y}_\theta)}{L_\alpha(y, \tilde{y})} \]

Maximizing (D^2_\alpha) is equivalent to minimizing the pinball loss:

\[ \arg\max_\theta D^2_\alpha(\theta) = \arg\min_\theta L_\alpha(y, \hat{y}_\theta) \]

Their gradients differ only by a constant scale:

\[ \nabla_\theta D^2_\alpha(\theta) = -\frac{1}{L_\alpha(y, \tilde{y})}\nabla_\theta L_\alpha(y, \hat{y}_\theta) \]

So in practice:

  • train by minimizing pinball loss (a proper training objective)

  • track D² pinball as a score on train/validation sets

Subgradient for a linear model#

Let (\hat{y}_i = x_i^\top w) and (u_i = y_i - \hat{y}i). For one sample, a valid subgradient of (\rho\alpha(u_i)) w.r.t. (\hat{y}_i) is:

\[\begin{split} \frac{\partial \rho_\alpha}{\partial \hat{y}_i} = \begin{cases} -\alpha & u_i > 0 \\ 1-\alpha & u_i < 0 \\ 0 & u_i = 0 \end{cases} \end{split}\]

This is enough to do (sub)gradient descent.

# Synthetic data with skewed / heteroscedastic noise
n = 350
x = rng.uniform(-2, 3, size=n)

noise = (0.6 + 0.3 * (x - x.min())) * rng.normal(size=n)
noise += (rng.random(n) < 0.15) * rng.exponential(scale=3.0, size=n)  # positive outliers

y = 1.0 + 2.0 * x + noise

X = x.reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

alpha = 0.9
def fit_linear_quantile_regression_sgd(X, y, *, alpha, lr=0.05, n_steps=600):
    # Fit y ~= b + X w by minimizing mean pinball loss with subgradient descent

    X = np.asarray(X)
    y = np.asarray(y).reshape(-1)

    n = X.shape[0]
    X_design = np.column_stack([np.ones(n), X.reshape(n, -1)])  # intercept + features

    w = np.zeros(X_design.shape[1])

    # Baseline loss for D² tracking (constant alpha-quantile)
    baseline_pred = np.full_like(y, np.percentile(y, alpha * 100), dtype=float)
    baseline_loss = mean_pinball_loss_numpy(y, baseline_pred, alpha=alpha)

    loss_hist = []
    d2_hist = []

    for _ in range(n_steps):
        y_pred = X_design @ w
        u = y - y_pred

        # subgradient d rho / d y_pred
        grad_pred = np.where(u > 0, -alpha, 1 - alpha)
        grad_pred[u == 0] = 0.0

        grad_w = (X_design.T @ grad_pred) / n
        w = w - lr * grad_w

        loss = mean_pinball_loss_numpy(y, y_pred, alpha=alpha)
        d2 = 1 - loss / baseline_loss if baseline_loss != 0 else (1.0 if loss == 0 else 0.0)

        loss_hist.append(loss)
        d2_hist.append(d2)

    return w, np.array(loss_hist), np.array(d2_hist)

w_qr, loss_hist, d2_hist = fit_linear_quantile_regression_sgd(X_train, y_train, alpha=alpha, lr=0.05, n_steps=700)

w_qr
array([3.0198, 2.2158])
# Optimization diagnostics
iters = np.arange(len(loss_hist))

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Train pinball loss vs iteration", "Train D² pinball vs iteration"),
)

fig.add_trace(go.Scatter(x=iters, y=loss_hist, mode="lines"), row=1, col=1)
fig.update_xaxes(title_text="iteration", row=1, col=1)
fig.update_yaxes(title_text="L_alpha", row=1, col=1)

fig.add_trace(
    go.Scatter(x=iters, y=d2_hist, mode="lines", line=dict(color="#4C78A8")),
    row=1,
    col=2,
)
fig.add_hline(y=0, line_dash="dash", line_color="gray", row=1, col=2)
fig.update_xaxes(title_text="iteration", row=1, col=2)
fig.update_yaxes(title_text="D²", row=1, col=2)

fig.update_layout(height=360, title=f"Subgradient descent for alpha={alpha}")
fig.show()
# Compare to OLS (mean regression) on the same alpha

# Design matrices
X_train_design = np.column_stack([np.ones(len(X_train)), X_train])
X_test_design = np.column_stack([np.ones(len(X_test)), X_test])

# Our quantile regressor

y_test_pred_qr = X_test_design @ w_qr

# OLS (targets conditional mean)
ols = LinearRegression().fit(X_train, y_train)
y_test_pred_ols = ols.predict(X_test)

print(f"Test D² pinball (alpha={alpha}):")
print("  quantile GD :", d2_pinball_score(y_test, y_test_pred_qr, alpha=alpha))
print("  OLS (mean)  :", d2_pinball_score(y_test, y_test_pred_ols, alpha=alpha))

print()
print("Test mean pinball loss:")
print("  quantile GD :", mean_pinball_loss(y_test, y_test_pred_qr, alpha=alpha))
print("  OLS (mean)  :", mean_pinball_loss(y_test, y_test_pred_ols, alpha=alpha))

print()
print("Coverage P(y <= y_pred):")
print("  quantile GD :", float(np.mean(y_test <= y_test_pred_qr)))
print("  OLS (mean)  :", float(np.mean(y_test <= y_test_pred_ols)))

# Fit visualization
x_line = np.linspace(x.min(), x.max(), 250)
X_line = x_line.reshape(-1, 1)
X_line_design = np.column_stack([np.ones(len(x_line)), X_line])

y_line_qr = X_line_design @ w_qr
y_line_ols = ols.intercept_ + ols.coef_[0] * x_line

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name="data", marker=dict(size=6, opacity=0.55)))
fig.add_trace(go.Scatter(x=x_line, y=y_line_ols, mode="lines", name="OLS (mean)", line=dict(color="gray", dash="dash")))
fig.add_trace(go.Scatter(x=x_line, y=y_line_qr, mode="lines", name=f"Quantile GD (alpha={alpha})", line=dict(color="#4C78A8", width=3)))

fig.update_layout(
    title="Linear quantile regression: fitted line vs OLS",
    xaxis_title="x",
    yaxis_title="y",
    height=420,
)
fig.show()
Test D² pinball (alpha=0.9):
  quantile GD : 0.26390620369117923
  OLS (mean)  : -0.1955308060885197

Test mean pinball loss:
  quantile GD : 0.35437222526113604
  OLS (mean)  : 0.5755556075140252

Coverage P(y <= y_pred):
  quantile GD : 0.9090909090909091
  OLS (mean)  : 0.5795454545454546

6) Practical usage (scikit-learn)#

d2_pinball_score is most useful when you care about quantiles rather than the mean:

  • prediction intervals / uncertainty bands (e.g. 10th and 90th percentile)

  • asymmetric costs (under-predicting is worse than over-predicting, or vice versa)

  • risk metrics like Value-at-Risk (VaR)

In scikit-learn you typically pair it with a quantile model such as QuantileRegressor:

from sklearn.linear_model import QuantileRegressor
from sklearn.metrics import d2_pinball_score

alpha = 0.9
model = QuantileRegressor(quantile=alpha, alpha=0.0).fit(X_train, y_train)
y_pred = model.predict(X_test)
d2_pinball_score(y_test, y_pred, alpha=alpha)

For cross-validation you can build a scorer:

from sklearn.metrics import make_scorer
scorer = make_scorer(d2_pinball_score, greater_is_better=True, alpha=0.9)
# scikit-learn QuantileRegressor (solves a convex optimization problem)
alpha_level = 0.9

qr = QuantileRegressor(quantile=alpha_level, alpha=0.0, solver="highs").fit(X_train, y_train)
y_test_pred_qr_sklearn = qr.predict(X_test)

d2 = d2_pinball_score(y_test, y_test_pred_qr_sklearn, alpha=alpha_level)
loss = mean_pinball_loss(y_test, y_test_pred_qr_sklearn, alpha=alpha_level)

(d2, loss, qr.coef_, qr.intercept_)
(0.245096850405492, 0.36342747394411146, array([2.2703]), 3.271546583795644)

Pros, cons, and when to use D² pinball#

Pros#

  • Quantile-aware: aligns evaluation with quantile regression and asymmetric costs

  • Baseline-relative: 0 means “no better than predicting the constant (\alpha)-quantile”

  • Robust-ish: pinball loss grows linearly in the residual magnitude (less outlier-sensitive than squared loss)

  • Works for intervals: evaluate different quantiles (e.g. 0.1 and 0.9) separately

Cons / pitfalls#

  • Requires choosing (\alpha): different quantiles answer different questions

  • Not scale-free across problems: like MAE, the underlying loss depends on the units of (y)

  • Can be negative: models can be arbitrarily worse than the constant-quantile baseline

  • Non-smooth objective: training involves subgradients or specialized solvers (LP, coordinate descent, etc.)

Good use cases#

  • forecasting with prediction intervals (demand, energy, latency, finance)

  • service-level problems (e.g. “allocate enough capacity so that 90% of days we’re covered”)

  • risk-sensitive settings (VaR-style quantiles)

Exercises#

  1. Plot mean pinball loss vs constant prediction (c) for several (\alpha) values and verify the minimizer moves from low to high quantiles.

  2. Fit two models for (\alpha=0.1) and (\alpha=0.9); plot the resulting prediction interval and estimate empirical coverage.

  3. Create a dataset where D² pinball is strongly negative and interpret it in terms of numerator vs denominator.

  4. Show numerically that alpha=0.5 makes pinball loss proportional to MAE.

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.d2_pinball_score.html

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_pinball_loss.html

  • Koenker & Machado (1999): Goodness of fit for quantile regression (Eq. 7): https://doi.org/10.1080/01621459.1999.10473882

  • Hastie, Tibshirani, Wainwright (2015): Statistical Learning with Sparsity (pinball deviance discussion): https://hastie.su.domains/StatLearnSparsity/